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1 INTRODUCTION 



Modern sequencing technology is providing an increasingly detailed picture of the distribution of genes across 
a wide array of taxa. Some molecular biologists have used these data to argue that unless ancestral genomes 
were considerably larger than present-day ones, extensive lateral gene transfer (LGT) must be invoked to explain 
the current distribution of genes p], [2], [32]. LGT is a process by which a gene (or genes) from one species is 
transferred into the genotype of another species by various genetic mechanisms. The extent of LGT is controversial, 
but it has been argued to be widespread in prokaryotes (e.g. bacteria) and during the earlier epochs of evolution, 
suggesting in turn that a network, rather than a tree, best describes the evolution of life [4 . 

Although the pattern of presence and absence of different genes across a set of species can suggest that LGT 
events occurred in the evolution of these species, another explanation is that certain genes arc simply lost in 
different lineages. As a result, various attempts to quantify the extent of LGT based on gene content have been 
developed, typically based cither on most-parsimonious scenarios or on stochastic models of gene genesis, loss and 
transfer (see, for example, [T], |10j . |13j). Attempts to reconstruct evolutionary histories under the assumption 
that no LGT events have occurred (and that genes arise just once) imply that some common ancestors of the 
considered species must have had far more genes than their current-day descendants. Doolittle et al. [5] refer to 
such an unlikely all-encompassing ancestral genome as the 'genome of Eden' hypothesis. Allowing LGT events 
reduces the need for genes to be present at earlier species, as illustrated for a single gene in Fig.[T] 
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Figure 1. The dilemma of ancestral genome inflation: If gene g, distributed as shown, is not transferred laterally 
then under the model, g must be in five ancestral genomes (*,+) not just at +. 

In this paper, we exploit the combinatorial structure that underlies a key biological insight on which a recent 
heuristics analysis of data was based by [1] (see also [2], [32]) ■ This insight is that simple models of gene evolution, 
in which a gene typically arises just once (gene genesis) but can be lost multiple times, imply lower bounds on 
the extent of LGT simply to prevent hypothetical ancestral genomes from becoming unfeasibly large. For such 
a model, we aim to bound the number of gene transfer events that have occurred in the evolution of a set of 
taxa, based on the presence/absence patterns of genes in each of these taxa, assuming that ancestral genomes are 
bounded by a given size. 
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Notice that we wish to count transfer events (rather than the total number of genes that are transferred), 
since in each transfer event, several genes may be transferred from one species into another. Thus our count of 
LGTs is conservative, and recognizes that genes are not independently transferred and that a transfer event may 
insert a section of the genome (with several genes) into an individual organism of a different species. 

The structure of this paper is as follows. In the next section, we define the model of gene genesis, loss and 
transfer precisely, and summarize our main results. We then provide proofs of these results in subsequent sections, 
and end with some concluding comments and a conjecture. 

2 MATHEMATICAL MODEL AND SUMMARY OF MAIN RESULTS 
2.1 Definitions and model specification 

We begin by recalling some notation concerning digraphs, and phylogcnetic trees and networks. 

Let v be a vertex of a digraph D. The indegree of v is the number of arcs directed into v, while the outdegree 
of v is the number arcs directed out of v. The indegree of v is denoted by d~(v) and the outdegree of v is denoted 
by d + (v). The degree of v is d~(v) + d + (v). Furthermore, u is an in-neighbour of v if (u, v) is an arc in D, while 
w is an out-neighbour of v if (v, w) is an arc in D. A digraph D is rooted if there exists a vertex, p say, of indegree 
zero such that, for each vertex v m D, there exists a directed path from p to v. 

Throughout the paper, X will denote a finite set of taxa and Q will denote a finite set of genes. A phylogenetic 
tree (on X) is a rooted tree whose root has degree at least two and all other internal vertices have degree at least 
three, and whose leaf set is X. More generally, a phylogenetic network N (on X) is a, rooted acyclic digraph with 
the following properties: 

(i) the root has outdegree at least two and, for all vertices v with d + (v) = 1, we have d~(v) > 2; and 

(ii) the set of vertices of outdegree zero is X. 

The elements of X are the leaves of N. For a subset U of the vertex set of N, the sub-digraph of TV = (V, A) 
induced by U is the digraph whose vertex set is U, and whose arc set is the subset {(u,v) : u, v g U and (u, v) g A} 
of A. 

We now describe the model of gene genesis, loss, and transfer. For each taxon x g X, assume that the subset 
G(x) of Q consisting of the genes in Q that have been observed in taxon x is known. We refer to the associated 
map G : X — > 2 e as a genome assignment. Let TV = (V, A) be a phylogenetic network on X. For a fixed positive 
integer k, and a genome assignment G : X — > 2 e , a (G, k)-gene labelling of N is a mapping F : V — > 2 e such that 
the following hold: 

(I) F{x) = G(x) for each x g X; 
(II) \F(v)\ < k for all v g V; 

(III) For each gene g G Q, the sub-digraph of iV induced by {v g V : # g ^(u)} is rooted (and therefore connected). 

Note that if x E X and |G(x)| > k, then iV has no (G, fc)-labelling. If N has a (G, fc)-labelling, we say that 
TV exhibits such a labelling. A gene labelling describes a possible evolution of the genes observed in the taxa 
under consideration. Property (I) says that each leaf of the network is labelled by the set of genes observed in 
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the corresponding taxon. Property (II) demands that each vertex is labelled by a set of at most k genes; the 
parameter k thus bounds the sizes of the ancestral genomes. Lastly, (III), means that each gene in Q is created 
once at most. There is no restriction on the number of times a gene is lost. 

Any function F which satisfies properties (I) and (III) we will call a G-gene labelling. With these definitions 
in hand we can now state the main results of this paper. 

2.2 Bounding the number of gene transfers required 

Our first result establishes lower and upper bounds on the number of LGT events required to explain a given 
data set. Suppose our input is given by a rooted phylogenetic tree T on X ("species tree"), a genome assignment 
G : X — > 2 e , and a positive integer k. Given a phylogenetic network N, we say that N can be obtained from T 
by adding h arcs, if there is a subgraph T' of N that is a subdivision of T (i.e. T' can be obtained from T by 
replacing arcs by directed paths) and h arcs of N are not arcs of T". Here, one views these added arcs as LGT 
events. 

We are interested in the minimum number of LGT events that must be added to T in order for the resulting 
network to exhibit a (G, fc)-gene labelling. We denote this minimum number by t(T, G, k). Given the above input, 
Theorem [I] provides lower and upper bounds for £(T,G,k). For a vertex v of T, let n(v) denote the number of 
genes g G Q for which there exist two leaves x±,X2 G X such that g G G(x\), g G G(x2) and the most recent 
common ancestor of x\ and xi in T is v. 

Theorem 1. Let T = (V, E) be a rooted phylogenetic tree on X , let Q be a set of genes, let G : X — > 2 e be a 
genome assignment, and let k be a positive integer. Then: 



(i) £(T,G,k) > ^l\{veV:n(v)>k}\. 

(ii) £(T,G,k) < 



\g\-k 



X 



The proof of Theorem [T] is given in Section [3] 
2.3 Hardness results 

The next two results show that two fundamental decision questions concerning the existence of (G, fc)-labellings 

are NP-complete. First, consider the following problem: 
Gene Labelling 

Given: A phylogenetic network TV on X, a finite set Q of genes, a genome assignment G : X — > 2 e , and 

a positive integer k. 
Question: Does N exhibit a (G, fc)-labellirig? 

Theorem 2. The decision problem Gene Labelling is NP-complete even if k = 1. 

A related problem, but concerning rooted phylogenetic trees, is the following: 
(G, £:)-Tree 

Given: A finite set X of taxa, a finite set Q of genes, a genome assignment G : X — > iP , and a positive 
integer k. 

Question: Does there exist a rooted phylogenetic tree N on X that exhibits a (G, fc)-labelling? 
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Theorem 3. The decision problem (G, fc)-TREE is NP-complete. 
The proofs of these two theorems are established Section [4] 

2.4 Algorithms 

Despite the apparent intractability of the two problems described above, there arc instances for which there exist 
polynomial-time algorithms. Several such instances are described in Section [5] One in particular is given next. 

Let N be a phylogenetic network on X. A sequence of vertices and arcs is an underlying cycle of N if it 
is a cycle of the underlying graph (i.e the undirected graph obtained by ignoring the directions of the arcs). A 
phylogenetic network N on A" is a galled tree if, for each pair G and D of underlying cycles, the vertex sets of G 
and D are disjoint. Each such cycle is called a gall. Theorem [4] shows that restricting the phylogenetic networks 
in Gene Labelling to galled trees, the decision problem becomes polynomial-time solvable. 

Theorem 4. Let N be a galled tree on X , let Q be a set of genes, let G : X — > 2 e be a genome assignment, and 
let k be a positive integer. Then there is a polynomial-time algorithm for deciding whether or not N exhibits a 
(G,k)-gene labelling. 

Theorem [4j together with the following corollary, is established in Section [5] 

Corollary 1. Let T be a rooted phylogenetic tree on X, let Q be a set of genes, let G : X -> 2 G be a genome 
assignment, and let k be a positive integer. If h is a fixed positive integer, then there is a polynomial-time algorithm 
for deciding whether or not there is a galled tree N on X that can be obtained from T by adding at most h arcs 
and which exhibits a {G,k)-gene labelling. 

3 HOW MANY GENE TRANSFERS ARE NEEDED? 

In this section, we prove Theorem [l] 

Proof of Theorem^ For the proof of (i), suppose that a network N admitting a (G, £;)-gene labelling can be 
obtained by adding £(T, G, k) arcs to T. It follows that there exists a tree T' that is a subdivision of T and a 
subgraph of N. In other words, T' is an embedding of T in N . An arc of N is said to be an Igt-arc if it is not an 
arc of T'. Consider two leaves X\, x% and their lowest common ancestor v inT' . Suppose that for a gene g € Q we 
have g £ G(xi) and g £ G(x2). Since network N admits a (G, fc)-gcne labeling F, there has to be an undirected 
path from X\ to x 2 in N containing only vertices u with g £ F(u). Furthermore, at least one such undirected 
path has to consist of two directed paths, one ending in x\ and one ending in x 2 , since the subgraph of N 
induced by {v E V\g G F(v)} is rooted and hence contains a rooted tree. There are four possibilities. Firstly, 
it is possible that this undirected x\ — ^2-path passes through v, implying that g e F(v). The remaining three 
cases are illustrated in Fig. [2] The first case is that the undirected x\ — X2-path uses an lgt-arc (a, d) between two 
vertices a,d that have v as their lowest common ancestor in T'. A second possibility is that the path uses two 
lgt-arcs (a, b) and (c, d) such that v is the lowest common ancestor of a and d in T". Finally, it is also possible 
that the path uses two lgt-arcs (6, a) and (c, d) such that v is the lowest common ancestor of a and d in T'. 
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Thus, for any vertex v with n(v) > k, there has to be either an lgt-arc (a,d) or two lgt-arcs (a, 6), (c, d) or two 
lgt-arcs (&, a), (c, d), with a and d two vertices that have v as their lowest common ancestor in T'. 

Given a vertex v of T, we say that an lgt-arc (s, t) satisfies v if u is the lowest common ancestor of s and t 
in X". Since in a tree there is a unique lowest common ancestor, each single lgt-arc satisfies at most one vertex. 
Furthermore, we say that a pair of lgt-arcs {(s, t), (s', t')} satisfies v if i> is the lowest common ancestor of either s 
and t', or of s' and £ or of t and £' in T'. It follows directly that each pair of lgt-arcs satisfies at most three 
vertices. Since there are £(T, G, k) lgt-arcs, in total at most 3( £ ^ T '^ ? ' fc - ) ) + £(T, G, k) vertices v with n(v) > k can 
be satisfied. From the previous paragraph we know that each vertex v with n{v) > k needs to be satisfied, either 
by a single lgt-arc or by a pair of lgt-arcs. It follows that there can be at most 3( e ^ T '^' k ^) + £(T, G, k) vertices v 
with n(v) > k. Part (i) follows by generously bounding 3( e(T f ' fc) ) + l(T, G, k) by §£(T, G, k) 2 . 




Figure 2. Illustration for the proof of Theorem [T] The three cases apply, without loss of generality, whenever 
g G G(xi), g G G{x2), but g F(v), where v is the lowest common ancestor of X\ and x-i in T' . Straight lines 
denote arcs, while curves denote paths. Solid curves are in T", while dotted lines/curves can be either in T' or 
only in N . 



For (ii), we can construct a network N admitting a (G, A:)-gene labelling as follows. We select a set G° of k 
arbitrary genes in Q and set F(v) = G° for each internal vertex v of T. The third property of a (G, fc)-gene 
labelling is now satisfied for the genes in G°. For the remaining \Q\ — k genes we do the following. We introduce 
/ = [ fc ~ fc ] additional isolated vertices v\,...,vj and label these vertices by disjoint sets F(vi), . . . ,F(vf) that 
partition Q\G° and contain at most k genes each. Finally, we add arcs from the root to each and from each Wj 
to each leaf x with G(x) n F(vi) 0. This leads to the claimed upper bound. □ 

To improve upon this simple upper bound turns out to be challenging. This can perhaps be explained by the 
results in the next section, in which we show that, even if the network N is given and k — 1, it is NP-complctc 
to decide if a (G, fc)-genc labelling of N exists. 



4 UNRAVELLING LATERAL GENE TRANSFER IS HARD 

We begin this section by first showing that Gene Labelling is NP-complctc. First consider the following decision 
problem: 

Directed Acyclic Subgraph Homeomorphism (DASH) 

Given: Directed acyclic graphs D = (Vh, Ed) and P = (Vp, Ep) with Vp C Vd- 
Question: Is P homeomorphic to a subgraph of Dl 
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A graph P is homeomorphic to a graph H if H can be obtained from P by replacing arcs (u, v) by internally 
vertex-disjoint directed u — v paths. Hence, DASH can be seen as a disjoint-paths problem. The graph P is called 
the "pattern graph" . It was observed by Fortune et al. [5] that NP-hardness of DASH follows from a result of 
Even, Itai, and Shamir [7] on multi-commodity flows. 

Theorem |2} The decision problem Gene Labelling is NP-complete even if k = 1. 

Proof. The reduction is from DASH. Let (D, P) be an instance of DASH. We begin by showing that we may 
assume, for each vertex u in P, we have dp(-u) + d~j,(u) = 1. To see this, let D' and P' be the digraphs obtained 
from D and P, respectively, by iteratively doing the following for each vertex v in P: 

(i) Let {si, S2, ■ ■ ■ , Si} be the set of in-neighbours of v in P and let \t\, £2, ■ ■ . , tj} be the set of out-neighbours of 
v in P. 

(ii) In P, replace v and the arcs (si, v), . . . , (sj, v) and (v, t\), . . . , [v, tj) with the new vertices vi, v 2 , ■ ■ ■ , Vi+j and 
the new arcs (si,«i), . . . , (si,Vi) and (v i+ i,ti), . . . , (v l+j ,tj). 

(iii) Let {xi, X2, ■ ■ ■ , x r } be the set of in-ncighbours of v in D and let {yi, j/2, ■ ■ • , 2/s} be the set of out-neighbours 
of v in D. 

(iv) In D, replace v and the arcs {x\, v), . . . , (av, i 1 ) and (w, yi), . . . , (v, y s ) with the new vertices V\, V2, ■ • • , t'i+j 
and the new arcs 

(xi,Vi), (x 2 ,v 1 ),..., (x r ,vi), (x 1 ,v 2 ),(x 2 ,v 2 ), ■ • • , (x r ,v 2 ), 

(xt,Vi), (x 2 ,Vi), . . . , (x r ,Vi) 

and 

(«<+ 1,2/1), (^+1,2/2), ■ • ■ , (vi+i,y s ),(v i+ 2,yi), (v l+2 , Vi), ■ ■ ■ , (v l+2 , y„), 

■ ■ ■ , (vi+j,Vi), {v i+j ,y 2 ), ■ • • , (v i+j ,y s ). 

At the end of this iterative construction, for each vertex u in P', we have dj>,(u) + dj>,(u) — 1. Moreover, it is 
straightforward to check that P' is homeomorphic to a subgraph of D' if and only if P is homeomorphic to a 
subgraph of D. It now follows that we may assume that our given instance (D, P) of DASH is of the form at the 
completion of this construction. 

We next describe a polynomial-time transformation of our instance (D, P) of DASH into an instance of Gene 
Labelling with k = 1. Set k = 1. We define A, A", Q, and the function G : X — > 2 e iteratively as follows. Initially, 
set ^ and Q to be both empty. Let N be the phylogenetic network obtained from D = (V, A) by applying the 
following sequence of operations: 

(O-I) For each arc a = (u,v) of P, add a new gene g a to G, add new leaf vertices £ u ,£ v to V and to X, add new 
arcs (u, £ u ) and (u,^„) to A, and set G(£ u ) = G(£ v ) — {g a }- Furthermore, delete all incoming arcs of u from 
A. At the end of (I), the constructions of the sets X and G, and the function G : X — > 2 e are completed. 
(OTI) Repeatedly remove all leaves of the resulting network not in X and repeatedly remove all vertices of indegree 
zero that do not have an element of A 7 as a child. 
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(O-III) Finally, root the resulting network by choosing a vertex of indegree zero as a root and then adding an arc 
from this root to each other vertex of indegree zero. Setting N to be the resulting phylogenetic network on 
X, we have now constructed the desired instance of Gene Labelling. 

An example of this construction is shown in Fig. [3] Note that, while D may not be connected, N is connected 
because of (O-III). We complete the proof by showing that N admits a (G, l)-gcnc labelling if and only if P is 
homcomorphic to a subgraph of D. 




Figure 3. An example of the reduction in the proof of Theorem [2j From an instance (P, D) of DASH, a phylo- 
genetic network N is constructed with leaf-labelling G(£ u ) — G(£ v ) = {g a } and G(l u i) = G(t v i) — {g a '}- Disjoint 
paths u — > W2 — > v and u' — > W4 — > v' in D correspond to a labelling F(£ u ) — F(u) = F(w2) = F(v) = F(£ v ) — 
{g a },F(l a >) = F(u') - F(wi) = F(v') = F(£ v ,) = {g a ,}. 

Suppose that P is homcomorphic to a subgraph of D. Then, for each arc a — (u, v) of P, there exists a 
directed u — v path in D such that all these directed paths are pairwise vertex disjoint. We first claim that 
for each such u — v path in D, there exists a corresponding u — v path in N. To see this, observe that, in the 
construction of iV from D, the only arcs deleted are those arcs directed into a vertex, u say, for which u is a vertex 
in P, and arcs incident with a vertex, w say, for which either there is no directed path from w to a vertex in X 
or there is no directed path from a parent of a vertex in X to w. 

None of these deletions deletes an arc on any u—v path in D and so the claim holds. Now, for each arc a — (u, v) 
of P and for each vertex w on the associated u—v path in N, set F(w) = {g a }- Since the children £ u of u and £ v of v 
are the only other vertices with a label containing g a , the subgraph of N = (V, A) induced by {w £ V\g a £ F(w)} 
is rooted and connected. Labelling all remaining vertices w by F(w) = thus leads to a (G, l)-gene labelling 
of N. 

Now suppose that F is a (G, l)-gcnc labelling of N. It remains to show that P is homeomorphic to a subgraph 
of D. Consider a gene g a £ G, and let a = (u, v) be the associated arc of P. Since F is a (G, l)-gene labelling, the 
subgraph of N induced by {w £ V(N) : g a £ F(w)} is connected. Furthermore, each of the arcs added in (O-III) 
in the construction of N joins two vertices that are assigned distinct genes in Q by F as F is a (G, l)-labelling of 
N. Thus none of these arcs are contained in the subgraph of N induced by {w £ V(N) : g a £ F(w)}. Since u has 
no other incoming arcs, u has indegree zero in this subgraph. Since the child £ v of v is also labelled F(£ v ) = {g a }, 
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it follows that N contains a directed path from u to v whose vertices are assigned {g a } under F. This path is also 
a directed path in D. Moreover, for two distinct genes g a ,gb £ Q, these paths are pairwise disjoints and so they 
are pairwise disjoint in D. The union of these paths in D forms a subgraph H of D such that P is homeomorphic 
to H . This completes the proof of the theorem. □ 

We turn now to the proof of Theorem [3] which is based on the concepts of tree- width and tree-decomposition 
from graph theory - we define these notions now; for further background the interested reader may wish to 
consult [3]. 

A tree decomposition of a graph H = (Vh, Eh) is a pair (T, {Xi : i S /}) where T = (/, E?) is a tree and, for 
all i € /, the set Xi is a subset of Vh such that: 

(i) \J i&I Xi = V B ; 

(ii) for each (u, v) £ Eh, there exists an i £ / with u, v £ X,; 

(iii) for each v £ Vh, the subgraph of T induced by {i £ I : v £ Xi} is connected. 

The width of the tree decomposition is defined as max^g/ \Xi\ — 1. 

We use the following NP-complete problem for the reduction in the proof of the theorem. 
Treewidth 

Given: An undirected graph H = (Vh,Eh) and a natural number fc'. 
Question: Does there exist a tree decomposition of H with width at most fc'? 
Theorem |3} The decision problem (G, A:)-Tree is NP-complete. 

Proof. The reduction is from Treewidth. Let (H, k') be an instance of Treewidth, and set X = E H , Q = V B , 
G(x) = {u,v} for each edge x = {u, v} £ Eh, and k = k' + 1. We complete the proof by showing that there 
exists a tree decomposition of H with width at most fc' if and only if there exists a phylogenetic tree N on X 
that admits a (G, fc)-gene labelling. 

Firstly, let (T, {Xi : i £ I}) be a tree decomposition of H with width fc'. For each {u, v} £ Eh, there exists 
an i £ I with u,v £ Xi. Hence, for each taxon x £ X , there exists a vertex i of T with G(x) C Xj. We construct N 
from T by choosing an arbitrary vertex as a root, directing all edges away from the root and, for each x £ X , 
adding a leaf x and an arc (i,x) where i is an arbitrary vertex of T with G{x) C Xj. Repeatedly deleting leaves 
not in X, set N to be the resulting rooted phylogenetic tree on X. We can now obtain a (G, fc)-gene labelling F 
of N by setting F (a;) = G(x) for each leaf x £ X and = Xi for each other vertex. For each gene g £ Q, the 
subgraph of N = (V,A) induced by {v £ V : g £ F(v)} is connected by property (iii) of a tree decomposition, 
and is rooted as N is a rooted phylogenetic tree. 

Now suppose that there exists a phylogenetic tree N on X and a (G, fc)-gene labelling F of N — (V, A). Then 
we can obtain a tree decomposition (T, {Xi : i £ I}) of H by setting I = V and X — ^(*) f° r a U i £ F, and 
defining T to be the tree obtained from N by ignoring the rooting and thus orientation of each of the arcs. All 
properties of a tree decomposition are clearly satisfied, and the width is at most fc' = fc — 1 because \F(i)\ < fc by 
the definition of a (G, fc)-gene labelling. □ 
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5 ... BUT SOMETIMES IT IS EASY 

Let N be a galled tree on X, let Q be a set of genes, let G : X — > 2 e be a genome assignment, and let be 
a positive integer. The main result of this section shows that there is a polynomial-time algorithm for deciding 
whether N exhibits a (G, fc)-labelling. If N is a phylogenetic tree, then this problem is equivalent to deciding if 
£(N,G,k) = 0. 

Proposition 1. Let T be a phylogenetic tree on X , let Q be a set of genes, let G : X — > 2 e be a genome assignment, 
and let k be a positive integer. Then there is a polynomial-time algorithm for deciding whether £(T, G, k) = 0. 

Proof. Deciding whether £(T, G,k) — is equivalent to deciding if T has a (G, fc)-gcnc labelling. With this in 
mind, it is easily seen that the following G-gene labelling function F of T minimizes k. For all v G V, the gene 
g G G is in F(v) precisely if v is a vertex of the minimal subtree of T that connects those leaves x for which 
g G G(x). If < k for v, then F is a (G, fc)-genc labelling; otherwise there is no such gene labelling of T. □ 

Proposition [2] (below) establishes the main result when N has exactly one gall. We will use this proposition 
as the base case for an inductive proof of the main result. The proof of this proposition relies on the following 
construction. Let TV be a galled tree on X with exactly one gall. Thus the undirected graph underlying N has 
exactly one cycle. Label (in order) the vertices of this cycle w\, W2, ■ ■ ■ , w p , where w p is the unique vertex in N 
with two arcs directed into it. 

Let F* be the following map from the vertex set V of N to 2 e . For each v G V, the gene g G Q is in F*(v) 
precisely if, ignoring the direction of the arcs, cither: 

(i) there is a pair of leaves X\ and X2 with g G G(x\) and g G G(x2), and v is on a path between x\ and xi that 
avoids w pi or 

(ii) there is a pair of leaves x\ and X2 with g G G(x\) and g G G(x2), and v is on all paths between Xi and Xi- 

The following two observations are important for what follows. First, if F is a G-gene labelling of N, then it is 
easily seen that F*(v) C F(w) for all v E V. Second, F* is not necessarily a G-gene labelling of N. The exact 
reason for this is that there can be a gene g G Q such that the sub-digraph of N induced by {v G V : 5 G F*(v)} 
consists of two rooted connected components; one lying below w p (more precisely, in the subgraph of N induced 
by the vertices that are reachable from w p by a directed path) and at one lying above w p (more precisely, in the 
subgraph of N induced by the vertices that are not reachable from w p by a directed path) . 

Now let Q' be the subset of genes g G Q for which the sub-digraph of N induced by {v G V : g G F*(v)} is 
disconnected. We extend F* to a G-gene labelling F of N by reformulating the problem as an undirected network 
flow problem and then using its solution to identify the extension. Here one can view each edge {a, b} as the two 
arcs (a, b) and (&, a). We construct an undirected graph U from N by starting with the sub-digraph of N induced 
by {wi, W2, ••• , vj p } and ignoring the direction of the arcs, adding a source vertex s, and, for each gene g G Q' , 
adding a new vertex s g and the three edges {s, s g }, {s g , lo^-i}, and {s 9 ,Wi 2 +i}, where i\ and «2 are the smallest 
and largest index i ^ p for which g G F*(wi). Now assign each s g capacity 1 and, for each i G {1,2, ... ,p — 1}, 
assign W{ capacity k — \F*(wi)\. 
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To illustrate the above construction, consider the galled tree N shown in Fig. 4(a) Each leaf x of N is labelled 
by the set G(x) of input genes observed in the corresponding taxon. The map F* is shown in Fig. |4(b)| The 



undirected graph U with k — 3 is shown in Fig. 4(c) 




(c) (d) 

Figure 4. (a) A galled tree N with one gall. Each leaf x of N is labelled by G(x). (b) The initial labelling F* 
in which, for example, the sub-digraph of N induced by {v £ V : 4 6 F*(v)} (displayed by the dashed arcs and 
their end vertices) consists of two connected components, (c) Auxiliary graph U with capacities in parentheses, 
(d) A (G, 3)-gene labelling of N. This gene labelling corresponds to a maximum flow in U which sends one unit 
of flow through s± and W\ and one unit of flow through S4 and 11)4. 



Lemma 1. There exists an integer flow in U from s to w p with value \Q'\ if and only if there exists a (G, k)-gene 
labelling of N. Moreover, if there is such an integer flow, then it leads to a (G,k)-gene labelling of N . 

Proof. First suppose that there exists such a flow / with value \Q'\. Based on /, we show that there exists a 
(G, fc)-labelling F of N. For this existence proof, we assume that we know the path that each unit of flow takes. 
We will conclude the proof by showing how an actual (G, fc)-labelling can be constructed. 

Initially set F — F* . Since / has value \Q'\ and each s g has capacity 1, there is exactly one unit of flow 
passing through s g from s to w p . Furthermore, as / is integer, it uses exactly one of the two edges {sgjW^-i} 
and {s g ,Wi 2+ i}. If / uses {s g , u^-i}, then the corresponding unit of flow either uses the vertices on the path 
from Wi t -\ to w p through {w\,w p } or the vertices on the path from w^—i to w p through {w p -i,w p }. Depending 
on which of these paths this unit of flow takes, add g to F(wi) for each of the vertices on this path. Similarly, 
if / uses {s g ,u>i 2+ i}, then the corresponding unit of flow either uses the vertices on the path from Wi 2 +i to w p 
through {wi, w p } or the vertices on the path from iUi 2 +i to w p through {w p -i, w p }. Depending on which of these 
paths this unit of flow takes, add g to F(wi) for each of the vertices on this path. Doing this for each g 6 Q', 
we claim that the resulting map F : V — > 2 e is a (G, fc)-labelling of N. Clearly, F satisfies (III). Furthermore, 
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as each vertex Wi has capacity k — \F*(wi)\, the cardinality of F(u>i) is at most k. Thus F satisfies (II). It now 
follows that F is a (G, fc)-labelling of N. 

Now suppose that there exists a (G, fc)-gene labelling F of N. By one of the two observations earlier, F*(v) C 
F(u) for all v € V. Consider a gene g £ The sub-digraph of TV induced by {v £ V : g £ F*(v)} consists of two 
rooted connected components. However, by (III), the sub-digraph of N induced by {v £ V : g £ F(v)} is rooted 
and connected. Therefore, there is a path on the cycle consisting of vertices Wi with g £ F(Wi) — F*(wi) that 
connects the two components. Sending one unit of flow from s to the first vertex on this path via s g , and then 
along this path to w p for each g £ Q' gives a desired integer flow. 

We have now shown that if there is an integer flow / from s to w p with value \Q'\, then there is a (G,k)- 
labelling of N . This does not directly give such a labelling as we can make no distinction on the flow units. In 
particular, it is not directly clear which of the two paths a flow unit takes once it reaches a vertex Wi in the cycle. 
This can be rectified as follows. Let / be such a flow and let g £ Q'. Ignoring the vertices w^, . . . ,Wi 2 , either the 
flow unit through s g takes the path from Wi t -i to w p via Wi or the path from u>i 2 +i to w p via w p -i. To make 
this decision, consider the following modification of the integer flow problem. Extend F* to F* by adding g to 
each of F*(wi 1 -i), . . . , F*(wi) and, for each of these vertices, subtract one from their capacities. If there is an 
integer flow from s to w p in U\s g of \Q'\ — 1 units, then we may assume that the unit of flow through s g in U 
follows the path from lu^-i to w p via w p -\. In this case, replace F* with F* and U with U\s g , and repeat for 
another clement in Q' — g. If there is no such integer flow in U\s g , then the unit of flow through s g in U follows 
the path from Wi 2+ i to w p via w p —i. In this second case, replace F* with that obtained by adding g to each 
of F* (lUjj-i), . . . , F*(wi) and, for each of these vertices, subtract one from their capacities, and replace U with 
U\s g . Continuing in this way, we eventually obtain a (G, fc)-labelling of N. □ 

To illustrate Lemma [l] and its proof, consider the example prior to the lemma, illustrated in Fig. [4] In U, a 
maximum flow could send either two units of flow through W\ or one unit of flow through vertex w\ and one unit 
of flow through vertex w^. From the latter option, one can for example obtain the (G, 3)-gcnc labelling shown in 

Fig, gag 

Proposition 2. Let N be a phylogenetic network on X , let Q be a set of genes, let G : X -> 2 e be a genome 
assignment, and let k be a positive integer. 

(i) If N is a galled tree with exactly one gall, then there is a polynomial-time algorithm for deciding whether N 
exhibits a (G, fc) -labelling, in which case, such a labelling can also be found in polynomial time. 

(ii) If T is a phylogenetic tree, then there is a polynomial-time algorithm for deciding whether £(T,G,k) = 1. 

Proof. First note that a maximum- valued integer flow can be found in 0(n 1,5 log(n • k)) time [B]. Thus, (i) follows 
from Lemma [lj For (ii), if \X\ = n, then there is 0(n 2 ) possible ways of adding a single arc to T. Applying 
Lemma [l] to each such way gives the desired algorithm. This completes the proof of the proposition. □ 

We now extend Proposition |2^i) to all galled trees using induction on the number of galls. Let N be a galled 
tree on X, let Q be a set of genes, let G : X — > 2 e be a genome assignment and let fc be a positive integer. If N 
has either no galls or exactly one gall, then we have such an algorithm by Propositions [l] and [2] so we may assume 
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that N has at least two galls. In this case there exists a vertex u\ of N with the property that, for some gall, each 
of the vertices in the vertex set of this gall are descendants of U\ and no vertex that is a proper descendant U\ 
has this property. Let N\ be the phylogenetic network obtained from N by replacing u\ and all of its descendants 
with a single vertex q^. Let Q\ be the phylogenetic network obtained from N by deleting all of the vertices of 
TV that are not descendants of u\ and adjoining a parent vertex n to u\ with one further child other than u±. 
Call the additional child vertex v\. Let Lq 1 denote the leaf set of Q\. Effectively, we have partitioned N into two 
phylogenetic networks Ni and Qi. See Figure [5] for an example. Let 

G(q 1 ) = G(v 1 ) = { |J G(*))n( (J G(x)). 

xEL Ql -{vi} xEX-L Ql 

The proof of the following lemma is straightforward, and so the details are omitted. 




{1,4,7} {2,3} {2,3,4} {1,7}{1,4,5,6} {6} {4,5,6} {4,6} {1,4,5} 




{1,4,7} {2,3} {2,3,4} {1,7} {1,4,5,6} {6} {4,5,6} {4,6} {1,4,5} 

Qi Ni 

Figure 5. A galled tree N and the decomposition of N into Qi an d N± described in the text. 



Lemma 2. The galled tree N has a (G, k) -labelling if and only if each of N± and Q\ has a (G,k) -labelling. 

By Proposition [2](i) , there is a polynomial-time algorithm for deciding whether or not Qi has a (G, /c)-labelling. 
If there is no such labelling, then, by Lemma [2] N has no (G, fc)-labelling. On the other hand, if Q\ has a (G, k)- 
labelling, then one needs to check if N\ has a (G, fc)-labelling. Now repeat the above construction with N replaced 
by N\. Continuing in this way, we cither find a galled tree with a single gall that does not exhibit a (G, /c)-labelling, 
and thereby show that TV has no such labelling, or we find no such galled tree and conclude that N has a (G, k)- 
labelling. Note that the number of galls in N is polynomial in the size of the vertex set of N. In particular, we 
have established the following results. 
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Theorem [4] Let N be a galled tree on X , let Q be a set of genes, let G : X — > 2 e be a genome assignment, and let 
k be a positive integer. Then there is a polynomial-time algorithm for deciding whether N exhibits a (G,k)-gene 
labelling. 

Corollary 2. Let T be a rooted phylogenetic tree on X , let Q be a set of genes, let G : X -> 2 G be a genome 
assignment, and let k be a positive integer. Lf h is a fixed non-negative integer, then there is a polynomial-time 
algorithm for deciding whether or not there is a galled tree N on X that can be obtained from T by adding at 
most h arcs and which exhibits a (G,k)-gene labelling. 

Proof. Suppose that N is a galled tree on X that can be obtained from T by adding at most h arcs. Then there 
is an embedding T' of T in N. Notice that since N is a galled tree, it follows that all vertices of N are contained 
in T and thus that N can be obtained from T by subdividing at most 2h arcs and adding at most h arcs. 

Hence, given T, we can try each possible way of subdividing at most 2h arcs and adding at most h arcs. 
For each such possibility, we check if the resulting network is a galled tree. In each such case we can check if 
a (G, fc)-gene labelling of this network exists, by Theorem [4] The time needed is polynomial in the size of the 
input, for each fixed h. □ 



6 CONCLUDING COMMENTS 

The analysis of this paper rests on a number of assumptions concerning gene evolution. Perhaps the most re- 
strictive is the requirement that gene genesis is a unique event. This requirement reflects the fact that a gene 
is typically a long and fairly precise sequence of nucleotides, and the probability that a similar sequence could 
evolve independently in a different part of the tree is small. This seems reasonable if DNA sequence evolution 
is described by a neutral model |llj . but, in some cases, natural selection will, no doubt, direct the evolution 
of DNA sequences towards certain genes that confer higher fitness. Thus, simple arguments based on neutrality 
need to be treated with caution. It would be interesting to extend the analysis of this paper to allow for a small 
frequency of independent gene genesis events. 

A related question is what degree of sequence similarity is required in order to classify two sequences as coding 
for the same gene. Insisting on exact sequence identity is too severe, since it is well known that different species 
typically encode a gene with slightly different sequences that result from random site substitutions (indeed these 
differences have been the main signal used for phylogenetic tree reconstruction [5] ) . This question of gene identity 
is also relevant to the probability of independent gene genesis: a region of DNA that codes for a gene could, in 
principle, accumulate sufficient site mutations to put it just outside the range of being identified with that gene, 
but could then mutate back within range, giving the appearance of a second gene genesis event. 

Other aspects of the model that may be criticized are the assumptions that the species tree is known with 
certainty (or, indeed, that it is meaningful to talk of a 'species tree' [4]), and that the model does not penalize 
gene losses at all. 

Our computational complexity results highlight that many problems are surprisingly difficult, even for a tree, 
and some questions still remain to be explored further. One that seems particularly interesting is described as 
follows, along with our conjecture as to its possible resolution. 
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Given a rooted phylogenetic tree T, a set of genes G(x) for each leaf x of T, and natural numbers k and h, 
consider the problem of deciding whether it is possible to add at most h arcs to T to obtain a phylogenetic 
network N that admits a (G, fc)-gene labelling. 

Conjecture 1. This problem is NP-hard in general, but for each fixed h, it admits a polynomial-time algorithm. 
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